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Abstract 

We  present  a  new  approach  for  protecting  sensitive  data 
in  a  relational  table  (columns:  attributes;  rows:  records). 
If  sensitive  data  can  be  inferred  by  unauthorized  users 
with  non-sensitive  data ,  we  have  the  inference  problem. 
We  consider  inference  as  correct  classification  and  ap¬ 
proach  it  with  decision  tree  methods.  As  in  our  previ¬ 
ous  work,  sensitive  data  are  viewed  as  classes  of  those 
test  data  and  non-sensitive  data  are  the  rest  attribute 
values.  In  general,  however,  sensitive  data  may  not 
be  associated  with  one  attribute  (i.e.,  the  class),  but 
are  distributed  among  many  attributes.  We  present  a 
generalized  decision  tree  method  for  distributed  sensi¬ 
tive  data.  This  method  takes  in  turn  each  attribute  as 
the  class  and  analyze  the  corresponding  classification 
error.  Attribute  values  that  maximize  an  integrated  er¬ 
ror  measure  are  selected  for  modification.  Our  analysis 
shows  that  modified  attribute  values  can  be  restored  and 
hence,  sensitive  data  are  not  securely  protected.  This 
result  implies  that  modified  values  must  themselves  be 
subjected  to  protection.  We  present  methods  for  this 
ramified  protection  problem  and  also  discuss  other  sta¬ 
tistical  attacks. 

1  Introduction 

Information  sharing  and  data  disclosure  have  led  to 
unprecedent  demand  for  effective  sensitive  data  protec¬ 
tion  methods.  If  sensitive  data  can  be  inferred  by  unau¬ 
thorized  users  from  non-sensitive  data,  we  have  the  in¬ 
ference  problem.  In  this  paper,  we  apply  the  decision 
tree  method  ([5])  to  inference  prevention  and  sensitive 
data  protection.  The  decision  tree  method  conveniently 
provides  a  more  localized  description  of  data  records. 
We  assume  that  the  data  set  is  in  the  form  of  a  rela- 
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tional  table  (columns:  attributes;  rows:  records)  and 
contains  two  parts:  one  is  the  training  and  the  other  is 
the  test  (Table  1).  The  decision  tree  method  was  first 
discussed  in  [1],  where  sensitive  data  were  represented 
as  values  of  the  classdabel  (i.e.,  the  class  attribute)  of 
the  test  data.  However,  sensitive  data  may  not  be  re¬ 
stricted  to  one  particular  attribute  and  the  classdabel 
in  many  data  sets  is  not  specified.  We  consider  the 
case  where  sensitive  data  are  distributed  over  the  en¬ 
tire  data  set.  We  extend  the  decision  tree  method  to 
handle  distributed  sensitive  data. 

2  Inference  Problem 

We  consider  a  simple  two-leveled  security  protocol  which 
has  High  and  Low  users.  The  High  users  (e.g.,  the 
database  manager)  view  the  entire  database,  and  the 
Low  users  share  the  High  view  with  the  exception  of 
any  confidential  data.  When  data  are  shared,  High  re¬ 
leases  some  of  the  non-sensitive  data  to  Low.  In  the 
pre-processing  of  data  release,  sensitive  data  are  re¬ 
placed  by  “?”s.  It  is  well-known  that  to  prevent  infer¬ 
ence,  removal  of  sensitive  data  alone  is  insufficient  and 
modification  of  some  non-sensitive  data  (e.g.,  blocking) 
is  necessary  (e.g.,  [3]). 

Inference  prevention  proceeds  as  follows  ([2]).  High 
generates  rules  from  the  available  data  set,  and  then 
determines  whether  there  is  inference  based  on  those 
rules.  If  the  inference  is  excessive,  then  it  implements 
a  protection  plan  to  lessen  the  inference  (i.e.,  decides  to 
modify  by  deleting  certain  data  from  the  database  as  it 
appears  to  Low) .  The  output  of  our  inference  model  is 
the  data  set  that  can  be  released  to  Low.  Our  goal  is  to 
make  modifications  as  parsimoniously  as  possible  and 
thus  avoid  imposing  unnecessary  changes  which  lessen 
functionality. 
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Table  1:  Relational  Table  for  Evaluation.  Aj  denotes 
the  jth  attribute  and  the  “?”  denotes  an  unknown 
value,  a  piece  of  confidential  datum,  or  a  previously 
modified  value. _ 
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3  Decision  Tree  Method 

The  classdabel  attribute  in  conventional  decision  tree 
methods  is  deterministic.  To  deal  with  inference  in 
the  presence  of  distributed  sensitive  data,  any  attribute 
may  be  considered  as  the  classdable  (thereby  the  orig¬ 
inal  classdabel  becomes  an  ordinary  attribute.) 

As  in  our  previous  work  ([1]),  sensitive  data  are 
viewed  as  classes  of  those  test  data  and  non-sensitive 
data  are  the  rest  attribute  values.  We  consider  infer¬ 
ence  as  correct  classification  -  the  lower  the  correct  clas¬ 
sification,  the  higher  the  security  of  the  data  will  be.  To 
prevent  inference,  we  increase  (decrease)  the  classifica¬ 
tion  error  (correct  classification)  by  modifying  the  set 
of  attribute  values  of  non-sensitive  data  that  yields  the 
largest  increase  (decrease)  in  classification  error  (cor¬ 
rect  classification).  We  formalize  the  requirements  in 
the  next  section. 

4  Metric 

Data  modification  is  most  likely  to  incur  degradation 
of  data  performance.  Important  metrics  in  data  modi¬ 
fication  are  the  effectiveness  measure  of  sensitive  data 
protection  (E)  and  the  measure  of  the  loss  of  func¬ 
tionality  ( F )  in  a  data  set.  In  terms  of  the  decision 
tree  method,  the  effectiveness  measure  (w.r.t.  to  the 
current  classdabel)  is  determined  by  the  classification 
error  of  the  test  data  (i.e.  the  confidential  data),  while 
the  measure  of  loss  of  functionality  is  a  function  of  the 
classification  error  of  the  training  data  (i.e.  the  to-be- 
released  data). 

Suppose  the  jth  attribute  is  posted  as  the  classdabel. 
Let  the  measure  of  protection  effectiveness  with  respect 
to  the  jth  attribute  be  denoted  as  Ej  and  the  measure 
of  the  loss  of  functionality  be  denoted  as  Fj .  The  over¬ 
all  measure  of  E  and  F  for  the  entire  database  are  the 


function  (e.g., weighted  average)  of  £)s  and  Ft s.  The 
measure  of  the  loss  of  functionality  F  is  usually  has  an 
upper  bound  of  a  given  threshold  v  (i.e.,  F  <  v)  that 
represents  the  maximum  level  of  information  loss  that 
users  are  willing  to  tolerate.  With  the  definitions  of  E 
and  F  in  mind,  our  optimization  goal  is  to 

Minimize  E,  while  keeping  F  <  v, 

i.e.,  we  optimize  E  with  F  as  the  objective  function 
(optimization  criterion).  Note  that  the  effect  of  pro¬ 
tection  is  evaluated  from  High’s  perspective,  while  the 
database  functionality  is  evaluated  from  Low’s  view. 

5  Modification  Control 

In  theory,  the  optimal  set  of  attribute  values  for  mod¬ 
ification  can  be  determined  by  exhaustively  evaluat¬ 
ing  every  possible  batch  of  attribute  values  of  non- 
confidential  data  and  selecting  the  batch  that  scores 
the  highest  with  respect  to  a  given  optimization  crite¬ 
rion.  Such  an  exhaustive  search  is  impractical  when 
the  volume  of  the  data  is  large.  Instead  of  evaluating 
each  attribute  in  turn,  we  prioritize  the  attributes  and 
evaluate  the  one  that  yields  the  highest  threat.  Dur¬ 
ing  modification,  we  visit  the  attribute  that  has  the 
largest  number  of  sensitive  data  records  and  the  low¬ 
est  classification  error  -  the  highest  inference  threat. 
(Prioritization  needs  to  be  carried  out  for  each  run  of 
modification.)  Let  the  total  number  of  confidential  at¬ 
tribute  values  be  denoted  as  5,  and  the  classification 
error  of  the  test  data  with  respect  to  the  jth  attribute 
be  Crj.  We  select  from  all  the  M  attributes  the  one 
that  maximizes  the  product  of  the  number  of  associ¬ 
ated  confidential  data  records  and  the  inverse  of  the 
classification  error 

'T  T? 

MAXjU  (1  -CV,-)(-^) 

where  T Ej  is  the  number  test  data  associated  with  the 
jth  attribute.  Handling  attributes  that  are  of  high  in¬ 
ference  threat  first  allows  us  to  achieve  the  necessary 
level  of  protection  more  effectively. 

6  Example 

Table  2  is  a  small  sample  taken  from  a  Submarine  De¬ 
sign  database1  and  represents  the  initial  Low  view.  In- 

lrThe  Submarine  Design  database  has  98  records  and  9  at¬ 
tributes  and  was  collected  in  the  Jane’s  Naval  Weapon  Systems. 


Table  2:  initial  Low  database 


name 

diesels 

range 

depth 

Agosta 

low 

short 

medium 

Foxtrot 

medium 

medium 

medium 

Sea.wolf 

high 

long 

deep 

S.  Cruz 

? 

medium 

medium 

Preveze 

low 

? 

medium 

formation  of  “diesels  of  S.  Cruz”  and  “range  of  Pre- 
veze”  is  assumed  to  be  sensitive.  From  the  available 
information  (D),  one  can  infer  these  two  pieces  of  sen¬ 
sitive  data  with  probability  1,  i.e.,  Pr(  “diesels  of  S. 
Cruz”  =  medium  |  D)  =  1  and  Pr(  “range  of  Preveze” 
=  short  |  D)  =  1.  The  inference  is  excessive.  After 
downgrading,  the  modified  Low  view  is  shown  in  Ta¬ 
ble  3,  where  modified  attribute  values  (i.e.,  “?”s)  are  in 
bold-face.  The  probabilities  of  these  two  pieces  of  sen¬ 
sitive  data  become  Pr(  “diesels  of  S.  Cruz”  =  medium 
|  D)  =  0.33  and  Pr(  “range  of  Preveze”  =  short  |  D) 
=  0.33,  indicating  equal  likelihood,  and  the  result  is 
desirable. 


Table  3:  modified  Low  database 


name 

diesels 

range 

depth 

Agosta 

Foxtrot 

Seawolf 

S.  Cruz 

Preveze 

low 

medium 

high 

? 

? 

short 

medium 

long 

? 

? 

medium 

medium 

deep 

medium 

medium 

7  Restoration  Attacks 

If  an  adversary  knows  the  strategy  of  inference  preven¬ 
tion,  then  (s)he  may  be  able  to  restore  the  modified 
attribute  values  (referred  to  as  the  restoration  attack). 
In  this  case,  the  sensitive  data  are  not  correctly  pro¬ 
tected. 

As  discussed,  sensitive  data  associated  with  an  at¬ 
tribute  are  deemed  as  the  classes  of  the  test  data  when 
this  attribute  is  posted  as  the  classdabel.  For  a  de¬ 
cision  tree,  the  root  node  (attribute)  are  more  likely 
to  be  selected  for  modification  than  other  nodes  (at¬ 
tributes)  ,  because  the  root  node  is  the  most  informative 
attribute  (in  terms  of  the  Shannon’s  entropy  measure) 
to  the  classdable.  Among  many,  we  consider  one  type 


of  restoration  attack  in  which  an  adversary  computes 
the  most  informative  attribute  w.r.t  a  classdabel,  posts 
it  as  the  new  classdable,  and  estimates  those  associ¬ 
ated  hidden  values.  For  example,  consider  the  “vot¬ 
ing”  data  ([4]).  In  this  data  set,  the  original  classdable 
is  “party”  and  the  corresponding  most  informative  at¬ 
tribute  is  “physician  fee  freeze” .  It  can  be  shown  that 
the  previously  hidden  attribute  values  (e.g.,  “physician 
fee  freeze” )  are  restored  by  using  some  other  attributes 
(e.g.,  “El  Salvador  aid”).  ( “El  Salvador  aid”  is  the  most 
informative  attribute  to  “physician  fee  freeze” .)  To  pre¬ 
vent  the  possible  restoration  of  modified  values,  we  re¬ 
peat  the  process  of  attribute  value  hiding  by  making 
previously  modified  non-sensitive  data  sensitive  until 
the  restoration  risk  drops  below  a  specified  threshold. 
(Of  course,  this  threshold  is  incorporated  in  F.)  The 
need  of  repeated  hiding  is  referred  to  as  the  ramifica¬ 
tion  problem  of  data  inference  ([2]). 

8  Future  Work 

We  will  study  the  effects  of  inference  prevention  based 
on  different  modification  methods,  investigate  differ¬ 
ent  types  of  statistical  attack,  and  extends  the  current 
model  to  distributed  environments. 
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